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ABSTRACT 

Finite  mixture  distributions  arise  in  many  statistical  applications. 

After  the  basic  definition  of  mixture  distributions ,  many  of  these 
applications  are  listed,  sampling  models  are  proposed  and  the  basic 
statistical  problems  are  described.  More  detailed  study  is  then  made  of  the 
use  of  the  familiar  statistical  methodologies  in  mixture  decomposition,  of  the 
incorporation  of  mixture  data  into  discrimination  procedures  and  of  the 
problems  that  arise  in  hypothesis  testing. 
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SIGNIFICANCE  AND  EXPLANATION 


Finite  mixture  distributions  are  usually  characterized  by  a  probability 
density  function  of  the  form 


k 

p(x)  -  l  x.f.(x)  , 
j-1  J  J 

where  *.,.«•  ,1^  are  probabilities  and  f f .  ( * )  are  themselves 
probability  density  functions.  It  is  often  helpful  to  interpret  the  {*,}  as 
prevalence  rates  of  observations  from  k  sources  and  the  {£.(•)}  as  the 
density  functions  for  the  observed  random  quantity,  conditional  on  the 
source.  In  a  typical  application,  in  sedimentology,  a  sand  sample  if  analyzed 
for  grain-size  (giving  a  frequency  distribution  of  values  of  x).  The  k 
sources  correspond  to  the  constituent  minerals  of  the  sand.  Mixtures  find 
application  in  a  very  wide  number  of  applied  fields,  such  as  geology, 
fisheries  research,  medicine,  electrophoresis,  economics,  botany  and 
communications.  They  are  also  useful  as  tools  in  some  branches  of  statistical 
analysis. 

The  paper  surveys  the  methods  of  solution  of  statistical  problems  which 
arise  with  data  from  a  mixture,  possibly  supplemented  by  further  data  whose 
source  identities  are  known. 

The  most  detailed  comments  are  related  to  mixture  decomposition:  given 
data,  to  estimate  any  unknown  features  of  the  model  underlying  the  formula 
for  p(x)»  All  the  standard  statistical  estimation  procedures  are  discussed 
and  particular  emphasis  is  placed  on  the  points  where  difficulties  arise  that 
are  peculiar  to  this  problem. 

The  statistical  discrimination  problem  usually  involves  the  use  of  a 
"training  set"  of  data,  whose  sources  are  unknown,  to  develop  a  procedure  to 
aid  the  identification  of  the  source  of  a  future  observation.  The  present 
paper  investigates  the  extent  to  which  mixture  data  can  contribute  to  such  a 
discriminant  rule. 


Finally  the  problem  of  testing  for  the  number,  k,  of  components  is 
discussed.  The  interesting  feature  again  is  that,  although  a  very  familiar 


summary  lies  with  KRC,  and  not  with  the  author  of  this  report. 


SOME  PROBLEMS  WITH  DATA  FROM  FINITE  MIXTURE  DISTRIBUTIONS 


* 

D.  M.  Titterington 

1.  DEFINITION  OF  FINITE  MIXTURE  DISTRIBUTIONS 

Suppose  that  a  random  variable,  X,  takes  values  in  a  sample  space,  S,  and  that  its 
distribution  is  represented  by  a  probability  density  function  (p.d.f.)  of  the  form 

k 

p(x>  -  l  *  f  (x)  ,  (x  6  S)  (1) 

j-1  3  3 


Then  X  is  said  to  have  a  finite  mixture  distribution.  The  parameters  (»^)  are  the 
mixing  weights  and  the  (f^(*))  are  the  component  densities.  It  is  easy  to  check  that 
p{*),  as  defined  above,  is  indeed  a  p.d.f.  on  8. 

Although  equation  (1)  appears  to  be  written  as  if  X  is  meant  to  be  a  univariate 
continuous  random  variable,  we  shall  subsume,  under  the  same  notation,  the  cases  of  random 
vectors  and  discrete  data,  interpreting  p(')  and  (f * >)  a»  probability  mass  functions 
in  the  latter  case. 

If  the  densities  (f^(*>)  are  of  specified  parametric  forms,  w«  shall  write 

k 

p(x)  “  1  *.f.(»|0.)  *  pl*|«,<>)  *  p(j>)  , 

3  3  3 

in  which  0^  denotes  the  parameters  relevant  to  f^(*),  0  denotes  the  aggregate  of  all 

distinct  parameters  in  0  ,..,,0^  and  <£  denotes  the  eet  of  all  parameter*  in  the  model. 

Although  there  are  a  few  exceptions  (see  Davie,  <982,  for  instance)  most  applications 
of  finite  mixtures  of  parametric  densities  involve  component  densities  of  the  same 
parametric  type.  In  this  case,  0(,...,9^  all  belong  to  the  same  parameter  space,  0, 
sey.  We  may  than  regard  »,  am  defining  a  probability  distribution  over  0,  and  writs 
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p(*ii>  -  I  v<«iV 

j-i  3  3 
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where  Gj{»)  denotes  tbs  probability  measure  on  0  defined  by  >. 

Pinits  sixtursa  correspond  to  finite  di aerate  Measures  G^( • )  and  we  shall  be 
concentrating  on  these.  The  nore  general  notation  of  (2)  clearly  suggests  the  generation 
of  p.d.f.'e  using  More  general  probability  Measures  oo  6.  These  say  be  called  general 
Mixtures.  The  formulation  in  (2 )  also  clarifies  the  origin  of  the  term  compound 
distribution,  which  la  sometimes  used  instead  of  mixture  distribution.  The  distribution 
on  S  represented  by  f(*|6)  is  compounded  with  that  on  0  given  by  If#  for 

instance,  f ( • |B }  is  a  Poisson  density,  we  obtain  so*called  compound  Poisson 
distributions. 

Another  revealing  feature  of  the  basic  p.d.f.,  as  given  in  (1).  ia  that  mixture  data 


can  be  regarded  as  lne 


i.  in  a  certain  sense-  Suppose  we  have  e  pair  of  random 


variables  (X.Y),  where  X  has  sample  epee#  8  and  Y  la  discrete,  with  sample  specs 
(l,...,k).  Suppose  elso  that  the  Joint  p.d.f.  et  X  •  a  and  Y  -  J  la  factoriaad  as 


pU.l)  -  P<)>p<*lj) 

•  *3*}***  <x  #  $,  J  •  t,...,k)  . 

than  the  mixture  density  (1)  is  the  margins  1  p.d.f.  for  X.  An  observation  from  the 
Mixture  can  therefore  be  regarded  aa  a  realisation  of  (X.Y) ,  but  with  the  value  of  Y 
missing.  As  we  shall  see,  not  only  does  this  interpretation  have  Immediate  meaning  in  many 
practical  problems  (In  which  wa  My  have  observations,  each  of  which  La  known  to  ooma  from 
one  or  other  of  a  set  of  k  source  populations,  but  it  is  not  known  exactly  which)  but  it 
elso  motivates  som  of  the  numerical  methods  required  for  parameter  as  time  t  ion, 
particularly  with  smslmua  likelihood  (Section  4.4). 
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2.  APPLICATIONS  OP  FINITE  MIXTURE  DISTRIBUTIONS 

In  this  section  we  Motivate  the  study  of  statistical  methodology  for  dealing  with  data 
from  mixture  distributions  by  giving  some  indication  of  the  number  and  variety  of 
applications.  These  applications  we  shall  divide  into  two  categories  to  be  called, 
somewhat  arbitrarily,  direct  and  indirect. 

In  direct  applications  there  will  be  a  belief  in  the  existence,  or  possible  existence, 
of  k  underlying  sources  from  which  the  experimental  unit  generating  X  comes.  The 
mixture  model  then  appears  directly,  as  built  up  in  the  final  paragraph  of  Section  1.  The 
following  list  of  direct  applications  is  therefore  a  list  of  applied  fields.  In  each  case 
there  is  a  "physical"  meaning  for  the  sources  or  mixture  components. 

By  an  Indirect  application  we  mean  a  circumstance  in  which  the  mixture  density  is 
being  used  as  a  mathematical  device,  to  facilitate  the  analysis  in  some  way. 

The  following  catalogue  is  intended  to  give  a  mare  taste  of  the  galaxy  of  applications 
that  may  be  unearthed. 

2.1  DIRECT  APPLICATIONS 

(1>  Sedlmentoloqry.  Samples  of  send  ere  often  analysed  by  measuring  the  frequency 
distribution  of  grain  eicee.  The  sand  may  be  known  to  be  e  (liters!)  mixture  of  several 
minerals.  It  is  of  interest  to  estimate  the  proportions  of  the  different  minerals  in  the 
send.  It  may  also  b#  desired  to  estimate  the  grain  sirs  distribution*  for  the  different 
minerals,  although  these  may  already  be  "known"  from  extensive  previous  survey  work. 

(il)  Botany,  in  (i)  above,  if  for  "mineral  type"  wo  write  "plant  type*  and  for  "send 
grain  aita"  we  write  "pollen  grain  site",  "plant  height*  or  "petal  dimensions",  than  we 
account  for  a  wealth  of  botanical  applications. 

(HU  Fisheries  and  sarins  biological  research.  Some  chsracter tetice  of  e  fish  ere 
easy  to  eaaiure  once  It  hae  been  landed,  these  Include  length  but  often  do  not  Include  sex 
(only  another  fish  can  do  this  easily  in  coma  speclesl)  or  age.  Data  on,  cay,  fish  longth, 
ara  often  used  for  the  estimation  of  eax  oro'ortiona  among  a  population  of  flah  of  the  eemm 
age  or  of  the  ege  distribution  of  a  mlxtw.u  c*  -  ral  years'  spawnings.  Figure  1,  taken 
from  ttoemer  (1973),  show#  a  hietogram  of  length  data  from  a  set  of  mala  and  femala 
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halibut.  The  separate  male  and  female  data  are  shown  in  Figure  1(a)  and  the  mixture  data 
in  Figure  1(b). 

(iv)  Medicine.  Sometimes  data  may  be  available  from  clinical  tests  on  a  group  of 
patients,  each  of  whom  is  known  to  be  suffering  from  one  of  two  diseases.  It  is 

not, however,  known  exactly  which  disease  is  affecting  each  patient.  A  mixture  model  is 
sometimes  used,  with  the  particular  aim  of  aiding  diagnosis  or  prognosis!  see  Section  S. 

(v)  Electrophoresis  and  gas  chromatography.  Electrophoresis  is  used  to  estimate  the 
relative  concentrations  of  proteins  in  experimental  samples  and  sometimes  also  to  establish 
which  proteins  are  actually  present.  Figure  2,  adapted  from  Tiselius  and  Kabst  (1939), 
shows  a  typical  electrophoresis  curve  of  concentration  against  the  "migration  position* 
achieved  relative  to  e  common  initial  position  by  the  end  of  the  experiment.  Different 
proteins  mlgrete  at  different  rates,  so  the  constituent  proteins  (in  the  example  of  Figure 
2.  they  are  albumin  and  a-,  6-  and  Y-globumin)  may  be  Identified.  This  differs  from  the 
other  applications  in  that  the  *dete*  are  themselves  in  the  form  of  e  smooth  curve. 

(vi)  Economics.  In  one  model  for  wage  bargaining  It  is  proposed  (Quandt  and  ftamaey, 
19?3)  that  there  are  two  possible  phases,  distinguished  by  some  critical  value  of  the  cost 
of  living  index, characterised  by  two  different  regression  models.  In  practice  it  may  not 
be  known,  at  any  time  at  which  data  are  gathered,  which  phase  ie  in  operation  and  this 
leads  to  a  statistical  modal  which  is  a  'mixture*  of  the  two  regress ions  (switching 
regressions) « 

(vii)  communications.  A  aeguanca  of  massages  is  received,  each  one  of  which  is 
either  a  Signal  or  Just  noise.  The  proportion  of  signals  and  the  signal  and  noise 
distributions  may  ba  of  interest. 

(viil)  Others »  Psychology,  paleantology,  geology,  agi.  "Uturo  and  too  logy  are  a  few 
of  the  many  other  fields  of  application. 

2.J.  muaECT  am.iCATitQKS. 

(i)  outlier  nopals,  a  mixture  of  k  »  2  densities  with  one  mixing  weight  close  to 
one  and  the  other  close  to  tero  ie  sometimes  used  to  modal  outliers.  The  so-called 
contaminated  Ho real  distribution#  form  one  such  claea.  Their  densities  are  of  the  fora 


Position 


riqurs  a.  Elsotropfwor**!*  eurv*  (to*  *nti-«99  albtmla  r«&b it  *eru», 


ixr  r  jv* -  vr-,;-  •  >-. 


«»»  s  rr.  sro***?  PSTJK»aiHtJ9P  ws 


«,o‘V(x  -  U,J >/ot )  ♦  (1  -  »1)o‘1*((x  -  U2)/o2)  ,  (3) 

where  0^,0^  >  ®*  /^e*p{“  T  u^}  and  *(i  aay,  i»  close  to  one.  For 

symmetric  models  is  imposed}  see  Barnett  and  Levis  (1978),  Abraham  and  Box 

(1978). 

(ii)  Heavy-tailed  and  multimodal  densities.  The  two-coaponent  Normal  mixture  (3), 
with  pj  »  Uj,  is  one  way  of  representing  a  symmetric  heavy-tailed  distribution.  When  the 
means  are  sufficiently  well  separated,  relative  to  the  variances,  (3)  represents  a  bloods! 
density}  see  Section  6. 

(ill)  Cluster  analysis  and  latent  structure  models. 

Multivariate  mixture  densities  (Normal-based  in  particular)  say  be  used  as  a  basis  for 
clustering  techniques  (Symons,  1981)  and,  in  special  cases,  form  latent  structure  models 
’.Fielding,  1977).  in  the  latter  application  the  problem  becomes  that  of  finding  a  mixture 
model  to  fit  the  data,  it  la  not  eaaentlaX  that  the  components  of  the  mixture  that  ia 
chosen  have  meaning  as  physical  sources,  although  era*  interpretation  may,  be  as  da,  in  the 
earns  apirit  in  which  factors  are  Interpreted  in  factor  analysis. 

(iv)  Honparametrlc  density  ext (nation.  In  the  kernel  method,  a  nonparameeric 
estimate  of  a  p.d.f.  f ( • )  ia  obtained  in  the  for* 

•  .  n 

f(x)  *  (oh)  l  Kftx  -  x,)/h)  . 
i-1 

Hare  h  la  a  so-called  smoothing  parameter,  m1,...,s  la  a  random  saspla  from  a 
population  with  the  distribution  which  givae  rise  to  f<*)  and  RIO,  the  kernel 
function,  ie  itself  e  p.d.f. i  see.  Cor  instance,  Vegman  0972).  The  astimate  <(•)  can 
obviously  bs  described  ae  an  equal lyvelghted  mixture  of  n  cosiponeat  densities. 

(v)  Modelling  of  prior  densities.  Mixtures  can  provide  rich  families  of  conjugate 
prior  densities  in  Bayesian  analysis.  If,  for  Instance,  each  observation  is  distributed  as 

Mtfl.l)  and  8  la  given  a  Normal  prior,  then  the  poeterior  for  8  ia  also  Normal.  The 
asms  eonjugacy  holds  if  the  prior  for  8  is  taken  to  be  s  k-oompoaemt  mixture  of 
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Normals.  If  a  "general"  mixture  is  used  we  are  led  to  hierarchical  priora  as  in  Lindley 
and  Smith  (1972). 

(vi)  Others.  These  include  random  number  generation  (Narsaglia,  1961),  modelling  of 
error  distributions  (Sorenson  and  Alapach,  1971),  manifestation  in  empirical  Bayes  methods 
(Deely  and  Lindley,  1981),  and  as  approximations  to  other  distributions.  Sometimes  this 
last  application  is  reversed.  For  instance,  a  lognormal  density  may  be  used  to  approximate 
to  a  shew  mixture  of  two  Normals)  see  also  Smith  and  Naylor  (1981). 


3.  SAMPLING  STRUCTURES  AND  BASIC  STATISTICAL  PROBLEMS 


3.2.  Discrimination  (Pattern  Recognition). 

Given  data  from  MO,  Ml  or  M2,  to  use  then  for  deriving  discrimination  procedures  and 
to  assess  the  worth  of  mixture  data  in  this  context. 

3.3.  Testing  for  the  number  of  components. 

Given  data  from  MO,  to  find  the  model  with  the  smallest  number,  k,  of  components  but 
which  is  still  compatible  with  the  data.  We  may,  for  instance,  wish  to  test  whether  the 
data  come  from  a  mixture  of  two  univariate  Normals  as  opposed  to  a  single  Normal.  A 
related,  but  not  equivalent,  activity  is  that  of  testing  for  the  modality  of  the  p.d.f. 

The  rest  of  the  paper  discusses  these  objectives.  Moot  of  the  space  is  devoted  to 
mixture  decomposition,  on  which  there  is  the  most  voluminous  literature.  In  general,  the 
methodological  principles  that  will  be  considered  are  very  familiar  and  we  shall  be 
discussing  what  are  just  particular  applications  of  these  standard  procedures.  What  makes 
the  mixture  problem  special  is  that  with  many  of  the  techniques  there  are  snags,  both 
theoretical  and  computational,  we  shall  emphasise  these  particularly  and  point  out  that 
some  of  the  complications  remain  unresolved. 
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4.  MIXTURE  DECOMPOSITION 


Before  launching  into  a  catalogue  of  various  ectiscatior,  methodologies  and  their 
application  to  mixture  data  we  go  through  soma  initial  questions  that  have  to  be  answered 
before  calculations  can  begin. 

4.1.  Preliminaries 

(i)  which  sampling  structure  is  in  operation!  MO,  Ml,  M2?  It  is  particularly 
important  to  decide  correctly  whether  Ml  or  M2  obtains.  With  MO  data(  estimation  of  the 
mixing  weights  is  notoriously  imprecise,  so  if  the  supplementary  categorised  data  tell  us 
more  about  *_  it  can  be  quite  a  bonus. 

( ii )  Whut  m  the  model  is  unknown? 

(a)  kl  This  sets  the  problem  up  essentially  as  one  in  cluster  analysis. 

(b)  »  only?  In  some  problems,  extensive  previous  experience  may  provide 

detailed  knowledge  about  the  component  densities  so  that  they  may  be  treated  aa 
known.  This  occurs  in  problems  in  sedimer.tology  (Section  2.1)  and  in 

remote  sensing,  in  which  aerial  photographs  are  analysed  to  discover  the 

,'lative  concentrations  of  several  crops  in  a  geographical  area. 

(c)  tf  (« )) *s  only?  Souetlmee  the  mixing  weights  may  be,  for  all  practical 
purpose*,  known.  A  sex-ratio  may  sometimes  be  assumed  to  be  unity,  for 
Instance. 

(d)  »  and  iLHjjJsT  Perhaps  the  most  common  case. 

In  cases  (c)  and  <d)  the  (f ^( • ) ) *e  are  unknown  and  we  have  the  following  dilemma. 

(Ill)  can  the  {f  <. •  )}*m  be  assumed  to  have  specified  parametric  forms,  or  not7  If 
the  answer  la  yes  then  we  may  subsequently  aspire  that  the  parametric  forms  be  simple  ones, 
such  as  Norma l l 

(lv)  is  the  clasa  of  mixtures  we  have  chosen  Identifiable?  This  is  the  first  of  the 
complications  that  may  arise  with  mixture  data,  although  it  doss  not  happen  often  In 
practical  problems.  Per  some  classes  of  mixtures  the  members  of  the  class  are  not  uniquely 
defined  sod.  If  this  la  ths  case,  estimation  procedures  ere  likely  to  run  Into 


difficulties.  The  main  culprits  are  some  discrete  distributions  on  finite  sample  spaces 
and  mixtures  of  uniform  distributions. 

(a)  Consider  mixtures  of  binomial  distributions  B^tN.Q),  with  N  known  but 
6  variable.  Then  the  class  of  k-component  mixtures  is  identifiable  only  if 
H  >  2k  -  1  (Biischke,  1962). 

(b)  bet  U^ta.b),  as  x  varies,  denote  the  p.d.f.  for  the  uniform 
distribution  on  (a.b).  Then  the  following  two-component  uniform  mixture 
p.d.f .'s  are  identical 

mu  (0 ,a)  ♦  (1  -  o)U  (a,1),  for  all  0  <  a  <  1  . 

X  X 

Theoretical  work  which  reassures  us  that  most  classes  of  finite  mixture  densities  of 
interest  are  identifiable  is  available  in  various  papers,  including  Teicher  (1963), 

Yokowltz  and  Spragias  ( 1968 )  and  Chandra  ( 1977 ) . 

(v)  Which  method  of  estimation  to  use?  The  decision  about  what  technique  we  shall 
use  may  well  be  based  on  our  statistical  philosophy  but  practical  fsasibility  may  also  play 
a  part,  aa  we  shall  see.  we  now  list,  by  subsection,  methods  that  have  beer  used  for  the 
fixture  problem. 


4.2. 

Graphicel  methode. 

4.3. 

hat hod  of  momenta. 

4.4. 

Naxlmua  likelihood. 

4.3. 

Minimus  distance  methods 

4.6. 

aayoalan  methods. 

4.7. 

Sequential  method*. 

4.8. 

Curve  fitting. 

For  purpoeos  of  illustration  wo  shall  restrict  detailed  attention  to  two  simple 

examples. 

trample  1,  Mixture  of  two  known  densities. 

ptx)  *  *,<,(*>  ♦  (1  -  «,)f.,(*)  (x  e  S)  ,  (5) 

where  f  ^  ( •  )  end  f ^  ( •  )  ere  known  end  0  <  *  ^  <•  1 . 
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Example  2.  Mixture  of  two  univariate  Normal  densities. 

The  p.d.f.  is  given  by  (3),  which  we  rewrite  here,  for  convenience 

p(x)  -  »10“1$(Cc  -  +  <t  -  ’yc'Vu  -  u2)/o2)  ,  (3) 

where  c^  >  0,  c^  >  o  and  0  <  <  1. 

As  an  indication  of  the  flexibility  of  this  as  a  model,  we  illustrate,  in  Figure  3, 
just  6  special  examples. 

4.2.  Graphical  methods 

These  have  been  used  both  in  an  exploratory  way,  for  obtaining  an  informal  assessment 
of  the  number,  tc,  of  components,  along  with  quick,  if  crude,  parameter  estimates  for 
subsequent  numerical  improvement,  and  also  as  the  only  method  of  analysis  applied  to  the 
data.  The  latter  was  common  in  early  work  in  applied  fields  and  was  stimulated  to  some 
extent  by  the  numerical  problems  associated  with  the  other  methods. 

The  graphical  methods  are  based  on  convenient  plots,  related  to  either  the  cumulative 
din  ribo'  on  function  or  the  p.d.f.  Itself.  The  most  fsmlllar  of  the  former  is  the  use  of 
iormal  probsbility  paper  with  example  2.  Figure  «  shows  the  theoretical  plots  for  s 
particular  Norawi  mixture  and  ite  components.  The  corresponding  plot  from  a  set  of  data 
can  be  used  to  assess  bother  the  characteristic  Normal  mixture  shape  is  apparent  and,  if 
so,  to  provide  estimates  of  the  means  end  variances  (from  the  asymptotes )  and  for  the 
mixing  weight  (roughly,  tiros,  the  point  of  inflexion)!  see  Powlkea  (1979)  for  a  useful 
survey  and  extanulon  of  this  tsohniauc. 

Other  plots  for  Example  2  have  been  based  on  the  p.d.f.  and  on*  of  its  data-based 
estimators,  the  histogram.  First  differences  of  the  logarithms  of  the  histogram 
frequencies  give  local  approximations  to  the  derivatives  of  the  logarithms  of  th»  Normal 
component  that  is  dominant  at  the  given  point.  Furthermore,  this  derivative  will  be  linear 
with  negative  elope  wh.ch  is  inversely  proportional  to  the  variance  of  the  dominant 
component.  These  facts  fora  too  basis  of  a  graphical  method  of  Bhattacharya  (1967).  The 
quadratic  nature  of  the  logarithm  of  a  Normal  p.d.f.  also  stimulated  a  aami-gr  \<,hical 


•11- 


Pigur*  3.  A  selection  of  density  function*  for 

mixture  of  two  univmrUte  Normal  daneitiee. 
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method  of  Buchanan-Wollaston  and  Hodgson  (1929).  In  general,  success  of  t!  ■  methods  relies 
heavily  on  the  mixture  components  being  fairly  well  separated. 

4.3.  Method  of  momenta 

Suppose  £  contains  s  distinct  parameters  and  that  (X) , • . • ,m^(X)  are  s  real¬ 
valued  functions  on  the  sample  spAce  such  that  their  expected  values  exist  as  independent 
functions  of  j>.  Write 

“3<4>  -  *“,<*>  '  3  -  * . .  • 

Then,  given  X  -  (X.,...,X  ),  a  random  sample  of  sice  n,  a  set  of  moment  estimators 
-  i  n 

for  £  can  be  obtained  by  solving 

Hi)  “  *tx)  »  (6> 


_  -1  ' 

where  (■)  -  n~  [  m  (x  ),  j  -  1,...,s  . 

J  i-1  3 

If  the  clssa  of  distributions  under  investigation  is  identif iabla,  consistent 
estimators  of  £  are  usually  obtalnad,  thanka  to  the  lawa  of  larga  numbers.  Asymptotic 
formality  and  the  asymptotic  covariance  structure  can  usually  be  deduced  from  a  tirat  order 
Taylor  expansion  of  (6)  into,  approx imatsly, 

u<£)  ♦  0(£H|  -  £>  -  a<£>  * 

where  !>(•)  denotes  the  matrix  of  first  derivatives  of  £(£>•  Approximately,  therefore, 

cov{|)  -  D(£)"1cov(i) (ot£)T)“1  .  (7) 

Although  this  la  so  far  quit*  satisfying,  problems  do  soma time  a  arise  when  attempts 
are  mads  to  solve  the  appropriate  realisation  of  (6).  firstly,  explicit  solution  may  not 
be  possible  end,  secondly,  there  may  be  no  solution  in  the  parameter  apace,  or  more  titan 
one. 


Hxample  1.  Mixture  of  two  known  densities. 

With  only  one  unknown  parameter,  * f ,  only  one  mosMint  aquation  la  required, 
furthermore,  the  equation  will  be  linear  In  e  ,  giving 

t,  -  -  u12>/<un  -  ula) ,  (0) 

where  •  /  *1  (x)f ^ (x)dx,  j  «  1,2.  It  ie  easy  to  check  that  ie  unbiased  for 
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and 

var(*t)  -  vartBjVUijj  -  u12)2  •  (9) 

Unfortunately  there  is  no  guarantee,  except  asymptotically,  that  0  <  <  1, 

although  in  this  simple  example  this  may  not  have  great  practical  import,  in  principle, 
study  of  the  right  hand  side  of  (9)  may  suggest  a  function  m^ ( • )  for  which  verity)  is 
small ,  or  even  minimal  and  which  would  therefore  give  an  ‘optimal"  moment  estimator. 
Although  achievement  of  this  requires  knowledge  of  itself,  some  practical  guideline 

may  well  be  possible  in  many  examples. 

The  usual  power  moments  are  coemonly  used  in  (8),  or  in  (6)  for  that  matter.  Another 
possibility  in  this  example  is  to  use  an  indicator  function  for  m^(‘).  That  is,  take 

•t(X)  -  1  it  X  <  c 
“  0  otherwise  . 

Then  mf  is  the  proportion  of  observations  in  the  sample  <  c>  eee  Johnson  (197}) 
ard  Jamas  (197B). 

Example  a.  Mixture  off  two  univariate  Korea!*. 

Possibly  the  earliest  systematic  look  at  mixtures  was  the  epplicetlon  of  the  method  of 
moeents  to  this  exsmpl*  by  Pearson  (1694)  in  s  study  of  forehsed  measurements  of  *  set  of 
male  end  female  crebe.  We  now  have  five  parameters  and  Pearaon  used  moment  equation#  for 
the  firet  five  centrel  moments.  After  e  certain  amount  of  elimination  of  variables  the 
computation  problem  reduce*  (I?)  to  that  of  finding  s  negative  root  of  a  ninth  dagrsa 
polynomial,  solution  of  which  was  no  mean  feet  in  the  1890' el  aac*.-substltuUon  then 
provide!  the  parameter  estimates.  Sometimes,  however,  th*  nonle  has  no  nagsttv*  root  and 
sometimes  more  then  ore.  this  is  awkward,  although  in  the  formsr  cate  *  single  normal  is 
often  an  adequate  modal  and  in  the  latter,  either  solution  is  usually  satisfactory.  Just 
how  often  these  and  other  complication*  arise  has  been  investigated  by  Bowman  and  Shantou 
(1971). 

Th*  number  of  papers  that  ars  directly  derivative  of  that  of  Pearson  (1B94)  run#  into 
dozens,  with  many  applications  and  modif icationa  of  th*  method  of  solution,  for  the 
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special  case  of  »  oj(  the  *nonic*  is  replaced  by  •  cubic  (Cohen,  1967)  and  a  neat 
graphical  aid  for  this  cat*  ia  given  by  Preston  (1953).  I  ha  bivariata  caaa  la  mentioned  by 
Charliar  and  Wicksell  (1924)  and  tha  Multivariate  by  Day  (1969)  and  John  (1970). 

Kixturea  of  tha  other  simple  par ana trie  diatributiona  have  also  bean  given  tha  method- 
ot-aoaenta  treatment.  Several  of  than,  in  which  tha  component  densities  are  one-paraaetar 
p.d.f.'s*  lead  to  a  general  pattern  in  which  there  ia  a  sat  of  mcmant  equations  of  tha  form 

k 

l  8*"V  -  c  ,  a  -  1*...,2k  *  (10) 

j-1  3  3 

where  8^  is  the  (scalar)  paranatar  associated  with  tha  jth  component  density*  cQ  •  1 
and  tha  other  (c#)  are  data- baa ad. 

They  include  mixtures  of  exponentials  (baaed  on  ordinary  power  aomanta).  binamiale  and 
negative  binamiale  (both  based  on  weighted  factorial  aoaenta,  with  0  as  tha  “sue caaa 
probability*)*  Poissons  (factorial  aomanta)*  one-parameter  Weibulla  and  one- paranatar 
gaamas  (both  baaed  on  weighted  power  eoeenta)*  and  a  generalised  method  of  momenta  due  to 
Kabir  (1966).  rox  an  illustration  of  the  a  tender d  method  of  solution  of  aquations  Ilka 
(10)*  aae  ttischka  (1964). 

A*  **  fccsepla  1.  (7)  may*  in  principle*  be  used  to  aelect  'optimal*  moments  for  use  in 
more  general  problems.  Tallis  and  bight  ( 1968)  discuss  the  choice  of  fractional  power 
arms  Tit  i  ao  as  to  minimise  det  cov(J),  as  given  by  (7)«  for  a  mixture  of  two  exponentials . 
4.4.  Kaxlaun  likelihood 

for  a  given  parametric  mixture  modal*  the  method  of  maximum  likelihood  is  available. 
That  there  are  difficulties  la  Immediately  apparent  if  wa  look  at  tha  WO  likelihood  given 
by  aquation  (4).  Almost  certainly  the  order  statistic  (In  tha  case  of  univariate 
continuous  data)  will  ba  minimal  sufficient  and  explicit  KLK'a  will  not  be  available. 
Numerical  optimisation  will  ba  necessary  although*  in  many  cases,  maximum  likelihood 
aaalyaia  of  tha  ’complete*  categorised  version  of  tlm  data  may  be  very  easy*  aa  would  ba 
tha  case  for  both  our  special  examples. 
txamola  1 

L0  *  VV  -  }%  <V*i1  *  *i2>  ♦  ’> 

i»1 
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where  f  *>  f^x^,  i  ■  1,...,n,  J  *  1,2.  Thu* 

3  log  L0/3i1  -  l  t*tl  -  f12)/P(xi)  (11) 

and 

3J  log  Lq/3i^  -  -  l  (f11  -  f^/ptx^2  .  (12) 

Although  we  •••,  fro*  (11),  that  tha  llkallhood  aquation  la  a  polynomial  aquation  of 
dagraa  up  to  (n  -  1)  in  t^,  aquation  (12)  shows  that  log  Lq  la  strictly  concave  in 
w  ,  ao  that  thara  ia  at  aoat  one  real  root,  »1 ,  of  (11)  and  it  gives  a  global  maximum 
of  Lq.  It  is  easy  to  check  whether  0  <  <  1  and  thus  determine  the  maximum 

likelihood,  say.  Peters  and  coberly  (1976)  generaliae  this  to  a  version  of  this 

example  with  more  than  two  components. 

Even  with  this  simple  problem,  however,  there  is  a  complication,  which  arises  in  the 
asymptotic  theory  of  maximum  likelihood.  It  la  fairly  easy  to  discover  that,  if  tha  true 
value  of  * ^  is  1  then,  asymptotically,  #1  «  1  with  probability  1/2.  Thus  • ^  ia 
not  asymptotically  Normal,  although  It  is  consistent.  The  standard  theory  falls  because 
the  true  «1  is  on  the  boundary  of  tha  parameter  apace. 

Example  2 

b  .  . 

tQ  *  b0(i>  -  »  («,<>,  ♦u*i  -  «,)/<>,>  ♦  <1  -  *,)°2  ♦<(*i  -  Uj)/oa>)  • 

The  two-component  univeriete  Normal  mixture  ia  by  far  the  most  commonly  researched  or 
applied  case  end  yet  its  llktllhood  surface  ia  a  potential  disaster  area,  it  ie  riddled 
with  alngulerltiee.  It  w«  set,  say,  9,  *  x(,  than  it  ia  easy  to  asm  that,  as  o1  *  0, 

*  **.  Furthermore,  there  ere  many  reported  caaee  of  weird  features  on  ths  likelihood 
eurfacee,  quite  apart  from  the  problem  of  singular  it  lest  see  for  instance  the  related 
Figure  1  of  Hartigaa  (1977).  They  include  multiple  maxima,  unusual  troughs  and  unusual 
behaviour  at  the  boundary  of  the  parameter  space.  In  spite  of  this,  the  method  of  maximum 
likelihood  i£  used  la  practice  for  this  problem  end  lUefer  (197a)  has  even  established  the 
existence  of  e  local  mexlmtim  of  Lq  for  which  the  usual  asymptotic  theory  holds.  To  same 
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extent  the  difficulties  are  lessened  if  there  are  supplementary  categorized  -lata  of  if  the 
parameter  space  is  restricted,  by  demanding  that  •  o^,  for  instance.  As  far  as  the 
computation  of  maximum  likelihood  estimates  is  concerned,  we  may  employ  traditional 
numerical  methods,  of  which  the  most  familiar  to  statisticians  are  the  method  of  Newton 
Raphson  and  its  derivative,  the  method  of  scoring,  both  of  which  calculate,  automatically, 
via  a  Hessian  matrix,  an  estimate  of  the  asymptotic  covariance  matrix.  Boew  (1967) 
discusses  a  one-stage  method  of  scoring  for  Example  1.  If  the  initial  estimator  is 
consistent,  then  the  first  iterate  is  Best  A*y^>totically  Normal. 

It  is  also  possible  to  exploit  our  interpretation  of  mixture  data  as  being 
"incomplete”  and  use  a  version  of  the  EM  (Expectation-Maximisation)  algorithm  of  Oempater 
et  al.  (1977).  The  algorithm  generates  a  sequence  of  estimates  {j;  }  of  jj  for  which 

the  corresponding  sequence  of  likelihoods  is  monotonic  increasing.  Although  it  oan  be  slow 
to  converge,  the  algorithm  is  usually  very  easy  to  program.  Many  of  its  manifee tat ions, 
including  those  related  to  mixture  problem s,  appeared  in  much  earlier  pepers  as  appealing 
successive-approximations  procedures,  without  the  general  structure  or  simple  proof  of 
monotonicity  being  spotted.  In  Section  1  we  interpreted  the  observed  mixture  data  x  (MO 
data)  as  originating  from  a  complete  data-aat 

((x|,yt>,...,(xn,yn>)  -  (x,;*>  , 

but  with  the  source  identifiers  y,«..«»yo  missing.  The  two-step  iterative  stage  of  the 
EM  algorithm  is  as  follows,  in  which  g  denotes  the  p.d.f.  for  tha  complete  date.  He 

f  m  \ 

suppose  that  parameter  estimates  (j>  }  are  currently  available,  to  be  Improved  upon  to 

give  Hopefully,  as  r  ♦  •,  t<r*  * 

B-atepi  Evaluate  st  log  >,  say. 

H-stepi  rind  to  maximise  Q{$,i’r>). 

Details  of  the  general  EH  algorithm  for  flniee  mixtures  are  given  Section  4.3  of 
Oompeter  at  al.  (1977),  where  it  la  found  more  convenient  to  express  the  source  identifiers 
in  terms  of  indicator  vectors.  Here  wm  show  the  appealing  forma  for  our  two  examples. 
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Example  1. 

g-«tepi  Given  t*r^4 


let  w^1  -  w'r)fij/p<r)(xi),  i  -  1, ... ,n,  j  -  1,2.  where 


*2^  ‘  1  - 


Jr), 


Jr), 


Jr), 


and  p'*'(x)  -  »|*'ft(x)  +  (1  -  >f2(x). 


H-etepi  » 


(r+1)  _  -1 


l 

i-1 


s‘r>. 

11 


Note  how,  in  the  E-stap,  the  n  obeervations  are  "allocated1*  to  the  two  coaponenta  by 
fractions  which  are  current  estimate*  of  predictive  probabilities.  That  ia 

-  Prob(yi  -  ^ lxt'*ir> >  * 

In  the  H-atep,  ia  obtained  aa  a  "relative  frequency*  baaed  on  aggregating 

these  fractions.  The  categoriiad-data  version  would  have  all  v^’a  aa  aero  or  unity. 
Example  2. 

E-atcpi  Given  -  <«<,*>  .u‘r>  ,o‘r) 1  ,u‘r’  ,o‘r) ).  let 

i-i,. ...«*.  3-1,2 


where  p$r>(x  )  -  l  A  -  1,...,n,  and 

}»1  3  13 


<r)#(r) 


fj*1  -  (o^rW,  -  Ujr))/o‘r)),  for  each  1,3 

(t) 

Again  the  (v|^  )  are  current  predictive  probabilities. 

H-etep>  for  3  ■  1,2, 


Jrai)  _  -1 


(r) 


3 


»  l  . 

l-l  3 


(ret)  f  <r)  ,  y  (r) 

3  *  *  V  i  "iS  * 

3  l-l  3  1-1  3 


end 


<oVr>,)  -  y  w(r!(x  , 

3  13  '  i  3 


.  (reD.J.  r  tr) 
w*  ’  '  A  wii 

1-1  13 


Note  the  similarity  of  M-step  to  the  calculations  for  fully-cat«gorlxed  dote. 

Similar  simple  recursion#  are  available  for  mixturea  of  other  parametric  distributions 
such  exponentials,  Poissons  and  their  generalization,  the  exponential  family.  Wolfe  (1970) 
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and  Day  (1969)  give  tha  EH  algorithm  for  aultivariata  Horaal  mixture*,  Skene  (1978)  that 
for  latent  class  analysis.  Hartley  (1978)  that  for  the  switching-regressions  modal  and  Baum 
at  al.  (1971)  that  for  the  Markov  chain  case  referred  to  in  Section  3. 

Many  other  families  of  mixtures  have  had  their  maximum  likelihood  Mthodolo^v  dealt 
with  by  this  or  other  algorithms.  They  include  hinomials  and  others  (Hasselblad,  1969), 
truncated  exponentials  (Mendenhall  and  Hader,  19S8),  uniforms  (Gupta  and  Miyawaki,  1978), 
von  Mis**  (Hard! a  and  Sutton,  1975),  logistics  (Anderson,  1979)  and  even  the  compound 
Poisson  distribution  (Simar,  1976). 

A  related  approach  is  the  so-called  'cluster  analysis*  method.  For  Example  2  this 
amounts  to  the  following.  Consider  all  2n  partitions  of  the  data  into  two  clusters.  For 
each  partition,  maximise  the  likelihood  and  choose  that  partition  and  corresponding 
parameter  eatlmatee  which  give  e  global  aaximum.  Symons  (1981)  emphasises  that  the  major 
usofulneaa  of  this  method  and  its  multivariate  version  is  in  cluster  construction  a* 
opposed  to  parameter  estimation  in  which  obvious  biases  occur.  In  the  univeriate  Normals 
case.  Example  2,  the  optimal  partition  corresponds  to  some  cut-off  value  o,  say,  such 
that  all  <  °  go  into  on#  component  and  the  rest  into  the  other.  That  tha  resulting 
variance  estimates,  say,  are  biased  is  quit*  clear. 

4.5  Minimum  distance  estimation 

A  wide  variety  of  estimation  procedures  eay  be  envisaged  which  can  be  interpreted 
informally  aa  tha  minimisation  of 

6  (data,  theoretical  distribution) 

over  the  second  argument,  where  4  i*  some  measure  of  difference  or  distance.  More 
formally,  we  may  shoos#  £  to  minimise 

I 

whet*  ?£  1*  tha  theoretical  cumulative  distribution  function  and  F  is  co*m  dat«-b**#d 

version,  the  moat  natural  baing  tha  empirical  distribution  function.  All  sorts  of  6  may 
be  chosan,  soma  of  than  metric*,  sou*  not,  and  indeed  tha  previously  mentioned  methods  of 
moment*  and  maximum  likelihood  can  be  described  in  these  terms.  The  latter  corresponds  to 
the  Eullback-bslbler  directed  divergence 
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«  (F,G>  -  /  log(dF(x)/dG(x))<JF(x)  , 

where  d?(x)/dG(x)  is  s  retio  of  'densities*.  Given  data  of  type  HO,  the  part  of 
depending  on  £  is 

^  n 

-  /  log(p(x|j>) )d?(x)  -  -  l  log  p(x  |j|)  -  -  log  L  . 

1-1 

Other  special  versions  are 

4  (P,G>  -  /  {f(xl  -  g(x))2g(x)”1dx 
c 

and 

4  (P,G>  -  4  (G,P), 

KC  C 

Discrete  versions  of  these  give  the  methods  of  minimum  chi-squared  and  minimum 
modified  chi-squared,  the  former  of  which  v.as  used  by  Fryer  and  Robertson  (19721  for  Normal 
mixtures  using  grouped  date. 

The  quadratic  distance  function 

4q(F,o>  »  /  (F<x)  -  Gtx))2dF(xl 
Is  useful,  particularly  for  our  Example  1. 
example  1 . 

m  2 

4  (F,F  )  -  n'  I  (  f  *  F  1*1  -  l/n)2  , 
u  -  1-1  )»1  J  J 

where  P^(»)  is  the  cumulative  distribution  function  from  f^t*).  Me  have  to  minimise, 
therefore,  e  quadratic  function  of  e^tj,  subject  to  ♦  e  »  '*  >  0,  *j  >  0.  if 

the  nonnegativity  constraints  are  Ignored,  explicit  solution  is  possible  for  ft  eee 
Necdonaid  and  Pitcher  (1979),  for  Instance. 

when  explicit  solution  la  not  possible,  numerical  solution  Is  required,  a  first  order 
Taylor  Expan* l cn  of  the  statlonarity  aquations  can  be  made  tho  basis  for  asymptotic 
results,  as  in  the  method  of  moments  or  maximum  likelihood.  In  particular,  asymptotic 
covariance  matrices  may  be  obtained. 

K  modification  of  the  basic  technique  is  to  minimise  e  distance  measure  between,  not 

M  W* 

F  and  P^,  but  and  fu(£).  say,  where  9^1)  1*  soma  transform  of  F^,  with 

auxiliary  variable  u,  and  is  ths  empirical  version.  The  distance  measure  depends 


2> 


on  u,  clearly.  One  approach  is  to  impose  a  weighting  measure,  W(u)  on  tne  range  of 
u  and  to  minimize 

Aw(l'  “  I  4 >»«n)  • 

Quandt  and  Ramsey  ( 197C .  »se  this  method  with 

(i)  6  quadratic/ 

(ii)  W(* )  a  measure  with  finite  support;' 

(iii)  the  moment  generating  function. 

They  apply  the  technique  to  Normal  mixtures  and  switching  regressions.  Kumar  et  al. 
(1979)  use  the  characteristic  funciton  with  a  continuous  measure  for  W( • ) .  So  far,  little 
has  been  said  about  the  obvious  problem  of  choosing  an  "optimal"  measure  W( • ) ,  as  far  as 
the  asymptctic  covariance  matrix,  cov(^),  say,  is  concerned.  It  corresponds  to  the 
choice  of  optimal  moment  equations  in  Section  4.3. 

A  slightly  different  use  of  distance  functions  is  that  of  Hall  (1981),  for  estimating 
mixing  weights  when  there  are  date  available  from  the  mixture,  providing  empirical  c.J.f. 

F,  and  from  the  k  component  distributionc .  giving  empirical  c.d.f.'s  F^ , . . . ,  F^ .  The 
*1»...,*k  are  chosen  to  minimize 

As  in  the  treatment  of  example  1  above,  the  use  of  a  quadratic  6  gives  explicit 
minimisation,  if  the  nonnegativity  constraints  are  ignored.  For  this  essentially 
nonparemetrlc  technique.  Hall  (1981)  derives  asymptotic  theory.  Titterington  (1993)  looks 
at  versions  for  discrete  and  smoothed  continuous  data. 

4.6.  Bayesian  method 

There  ia  usually  e  strong  similarity  between  the  reletive  ease  that  ia  poaeibie  with 
likelihood  inference  and  Bayealan  methods.  In  principle  the  Bayesian  approach  promlaea  to 
be  the  more  amenable  with  mixture  data.  In  practice  we  run  into  difficulty  again,  as 
illustrated  below  with  MO  data. 

Example  1 

n 

L0  -  n  ♦  (1  -  «,)tu>  -  l  g(«,X)  ,  113) 

i"’>  2  terms 
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where  the  summation  is  over  all  possible  £.  LQ,  therefore,  is  the  sum  of  2n 
likelihoods  each  of  which  corresponds  to  categorized  data.  If  categorized  data  are  easy  to 
deal  with  in  Bayesian  analysis  (in  other  words,  if  there  is  a  convenient,  conjugate  family 
of  priors)  then  the  same  will  be  true  for  mixture  data.  In  this  example,  if  a  Beta  prior 
is  available  for  then  the  posterior  density  for  will  be  that  of  a  calculable 

mixture  of  Betas.  Unfortunately,  the  number  of  mixture  components  is  2n,  which  quickly 
becomes  large  with  n.  If  the  number  of  mixture  components  were  k,  we  would  end  up  with 
a  kn-component  mixture  for  the  posterior  for  w  . 

Example  2 

2  2 

Hers  the  natural  prior  structure  is  to  have  w^,  (11^,9^)  and  (u2,<J2)  mutually 

independent.  Then  "as  usual*  choose  a  Beta  prior  for  » (  and  a  Normal/inverse  Gamma  prior 
2  2 

for  each  of  (u^.o^)  and  (Uj.Oj).  Again  exact  results  may  be  written  down  in  terms  of 
2n-coaponent  mixturea  for  joint  and  marginal  posterior  p.d.f.'s. 

various  ways  of  coping  with  this  computational  and  storage  problem  have  been 
considered. 

(1)  if  only  poaterior  expected  values  are  of  interest,  use  numerical  integration 
baaed  on  (13).  This  may  not,  however,  be  very  helpful  in  ecme  circumstances.  If,  for 
instance,  a  poaterior  density  ie  multimodal,  then  the  posterior  mean  may  be  an  unhelpful 
Index  of  location.  Numerical  integration  may  however  be  the  way  to  calculate  predictive 
danaitlea,  as  given  by 

qU)  "  B  p(*U>  -  /  p(*|£)t(J|x)dj;  , 
where  t(»|x)  denotes  the  poaterior  p.d.f.  for 

(ii)  Neglect  terms  in  the  posterior  which  are  known  “-u  be  email.  When  a 
contamination  model  is  used  for  outlisre,  ■  la  considered  to  be  close  to  1.  Only  those 
terms  In  with  small  powers  of  (1  ■  *,)  are  retained  end  the  poaterior  p.d.f.  la 
renormalized  appropriately  (box  and  Tleo,  1968,  Abraham  and  box,  1978). 

(ill)  Select  a  (comparatively)  small  number  of  the  2n  tarns  at  random,  evaluate  them 
and  renormalize  (Leonard,  1982). 
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(iv)  If  only  the  predictive  deneity,  say,  is  of  interest  and  not  the  parameters, 
themselves,  replace  the  mixture  density  by  another  with  similar  characteristics  but  which 
is  more  amenable  to  practical  Bayesian  analysis  (Smith  and  Naylor,  1981). 

(v)  Use  an  approximate  method  based  on  sequential  incorporation  of  the  data 
(Section  4.7). 

A  Bayesian  version  of  the  "cluster  analysis"  approach  (see  end  of  Seotion  4.4)  is 
discussed  by  Binder  (1978). 

4.7.  Sequential  methods 

There  is  an  important  alass  of  methods  in  which  the  data,  x^,...,xn>  are  treated 
sequentially  and  which  lead  to  ways  of  decomposing  the  mixture  approximately.  Nany  of  the 
procedures,  particularly  related  to  Example  1  (known  component  densities),  were  developed 
in  the  electrical  engineering  literature  and,  consequently,  introduce  a  new  jargon.  The 
decomposition  problem  itself  is  called  that  of  unsupervised  learning,  in  that  we  have  to 
process  HO  data  without  being  told  the  whole  story,  namely,  the  identities  of  the 
sources.  In  the  engineering  context,  the  sequential  nature  of  the  analysis  serves  the  need 
to  process,  on-line,  data  which  become  available  sequentially.  Whan  such  methods  have  been 
developed  in  the  statistical  literature  there  has  also  been  the  prlnoiple  of  trying  to 
obviate  the  computational  difficulties  implicit  in  maximum  likelihood  and  Bayesian 
analysis,  aa  us  shall  aaa.  We  shall  use  Example  1  to  illustrate  four  procedures. 

(i)  Decision  directed  (00). 

(11)  Deeming  with  e  probabilistic  teacher  (PT). 

(ii)  Quasi  maximum  likelihood  (QMM. 

(iv)  Quasi  Beyae  (QB). 


Example  1 

Suppose,  after  r  observations  have  been  dealt  with,  the  "current"  estimate  of 
la  the  aoxt  observation,  x^,  **•  evaluate  (cf.  Section  4.4)  weights 


(rel) 


.(r> 


ir>  „ 


-  V  VW*  ' 


sod 
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These  weights  have  possible  application  (see  Section  5)  in  the  classification  of 
xr+1  *nto  one  or  other  of  the  two  component  populations.  The  procedures  develop  quite 
naturally  from  this  interpretation,  particularly  the  first  three. 

DDi  Assign  observation  r  +  1  to  component  1  (resp.  2)  if 
w‘w,»  >  (resp.  <)  w^1*. 


PTt  With  probability  w 


(r+1) 


j 


assign  observation  r+1  to  component  j,  j  “  1,2. 
(r+1) 


QMLj  Assign  a  “fraction"  w^~ ' ' '  of  observation  (r+1)  to  component  j, 
j  -  1.2. 

This  leads  to  the  following  recursive  algorithms,  stated  here  in  forms  which  fit  in 
with  aomments  later  on. 

D0<  If 


If 


(r+1)  v  (r+1) 

w  >  u 

1  2 


(r+1  (r+1) 

W1  <  W2 


.j1*11  -  .5r>  -  (r  +  if1^1'  -  1) 


,(r+1)  .  ,(r) 


(r  +  l)’1*^ 


(14) 


(15) 


PTi  With  probability  w***11,  (14)  holdsj  otherwise,  (15)  holds. 


QMbi  For  j  -  1,2,  -  w‘r>  -  (r  +1)_1(i^lf) 


Jr+1). 

3  ’* 


(16) 


In  the  QB  approach  the  rationale  is  to  maintain  a  Beta  density  for  at  each  stage 


(r) 

1  ’ 


If,  at  stage 


and  a  recursion  ia  set  up  on  the  mean,  for  whioh  we  use  the  notation  * 
r,  -  Be(af,B^),  so  that  *jr)  ■  0^/(0^  *  8r>*  then  distribution  of  at 
stage  r  +  1  ought  to  be  a  mixture  of  a  Be(af  +  1,6^)  and  a  ♦  1).  Instead,  we 

approximate  to  this  mixture  by  a  single  Beta,  with  parameters  +  w|r+1'  and 
6^  ♦  w^**'*,  with  (w***1  *,Wjr>' *)  defined  in  terms  of  exactly  as  above.  We 


obteln 


-  «'rl  -  «»r  ♦  0f  ♦  l)"1!*^  -  w‘r>),  3  -  ’.2 


Obvlouely  the  reaulte  will  depend  on  the  order  in  which  the  data  ara  incorporated  but 
the  on-line  fecility  may  over-ride  this  criticism.  The  important  theoretical  question  la 


whether  convergence  can  be  guaranteed  of  t 


(r) 


to  the  true  i  es  r  ♦  •  (n  ♦  ■).  The 


1  1 

recursions  (14)  -  (1?)  have  been  written  in  forms  whioh  suggsat  that  the  key  will  lie  in 
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the  theory  of  stochastic  approximations  (Hasan,  1964),  For  tha  DO  method  it  is  known  that 


sometimes,  the  sequence  {i J  '}  may  "runaway*  to  a  value  other  than  the  true  *1 .  For  the 
other  methods,  consistency  can  be  established.  Similar  sequential  procedures  may  be  set  up 
for  more  complicated  mixtures  (smith  and  Makov,  1978),  Titteriugton,  1976,  Titterington  and 
Jiang,  1981).  A  useful  survey  is  provided  by  Makov  (1980). 

4.8.  curve  fitting. 

So  far  we  have  given  no  indication  of  how  to  analyte  (exactly  the  right  word  here I ) 
the  electrophoresis  curve  of  Figure  2.  Here  the  data  are  themsleves  a  smooth  curve,  in 
electrophoretic  practice  informal  methods  are  soawtimas  used  for  estimating  the  relative 
concentrations  of  the  proteins.  The  area  under  the  curve  is  divided  up  in  as  fair  a  way  as 
possible  and  the  sub-areas  are  measured  using  a  gadget  called  a  planlmeter. 

For  a  more  formal  analysis,  minimum  distance  methods  may  be  used  (Section  4.5)  and  a 
modified  type  of  Fourier  analysis  is  also  available,  thanks  largely  to  Medgyessy  (1977). 

This  approach  is  stimulated  by  the  following  obvious  statement  about  curves  like  the 
p.d.f .  corresponding  to  Example  2.  Suppose  we  let 

p(x|j(i,l)  -  -  U,)/®^)  +  (1  -  ",  >02A*( <*  ”  U2,/<,21)  '  <10) 

where  -  o*  -  X,  j  -  1,2  and  0  <  1  <  min(o*,c*). 

As  1  increases  frost  aero,  the  mixture  becomes  more  and  more  clearly  bimodal  and  the 
parameters  become  easier  and  easier  to  estimate  from  the  curve.  By  operating 
mathematically  in  a  specified  way  on  the  datum  curve  it  is  indeed  possible  to  draw  data- 
baaed  versions  of  (18)  and,  thence,  to  decompose  the  mixture.  Medgyessy  (1977)  gives 
details  for  both  continuous  and  discrete  data.  Stanat  (1966)  gives  multivariate 
versions.  Gregor  (1969)  applies  the  procedure  to  histogram  data  and  Tarter  and  silvers 
(1975)  decompose  bivariate  Normal  mixtures  in  a  rather  similar  manner. 
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5.  DISCRIMINANT  ANALYSIS 


In  usual  discriminant  analysis  there  are  training  sets  of  categorised  observations 
from  k  sources*  from  these  data  a  procedure  is  developed  for  assessing  the  possible 
source  of  a  further  observation,  x,  say,  the  source  of  which  is  unknown.  Our  questions 
here  are  whether  further  uncategorised  data  can  be  built  into  the  discrimination  procedure 
and  whether  the  discriminatory  performance  is  improved  thereby.  In  the  limiting  case  only 
unoategorised  data  are  available  from  the  start  (MO  data),  (in  practice  it  may  be 
expensive  or,  in  soma  medical  contexts,  dangerc _s  to  obtain  enough  information  to  fully 
categorise  an  experimental  unit.  If  therefore  unoategorised  data  are  useful  as  such,  this 
could  be  very  welcome.) 

Whether  or  not  uncategorised  data  are  useful  at  all  depends  critically  on  the  model 
chosen  for  the  joint  probability  density 

p(x*y> 

of  x  and  the  source  identifier  y.  We  may  write  either 


p(x,y)  *  p(x)p(y|x) 

<D) 

p(x,y)  •  p(x|y)p(y) 

(S) 

in  which  (D)  recognises  the  diagnostic  paradigm  and  (s)  the  sampling  paradigm  of  Dawid 
(1976). 

In  discriminant  analysis  we  are  interested  in  using  the  training  date  to  tell  us 
about  p(y|x).  If  a  parametric  version  of  (D)  is  set  up  such  that  the  parameters 
associated  with  the  two  factors  on  the  right  hand  side  are  distinct,  then  no  amount  of  data 
on  unoategorised  data  give  any  information  at  all  about  p(yix).  If  (S)  la  used  aimilarly, 
however  end  we  obtain,  by  Bayes  Theorem, 

p(yix)  -  p(y)p(x|y)/p(x)  , 

where  p(x)  ia  e  mixture  density,  ae  in  Section  1,  then  the  unoategorised  data  will  effect 
the  discrimination  procedure  and  its  performance.  In  particular,  ee  the  amount  of 
unoategorised  date  available  increases,  p(y|x)  should  be  estimated  consistently. 
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Discriminant  rulaa  baaad  on  estimstsd  likelihood  ratios  will  tend  to  the  optimal  rule  (in 
tens  of  misclaamification  rates,  that  is) . 

Example  2  (Restriction!  m  o  -  a} 

In  this  case  the  likelihood  ratio  rule  can  be  written  in  terms  of  a  discriminant 
function  that  is  linaar  in  x  and  depends  on  the  unknown  ptiameters  (Lachenbruch,  1975) . 
These  parameters  may  be  estimated  from  mixture  data,  with  or  without  supplementary 
categorised  data  sets,  using,  for  instanae,  the  EM  algorithm  of  Section  4.4*  Performance 
say  be  assessed  either  empirically  or  by  considering  the  asymptotic  expected  rate  of 
uisclassification.  O'Neill  (1978)  and  Ganesalingaa  and  MeLachlan  (1978),  in  almost 
simultaneous  publications,  showed  that  the  mixture  data  can  help  in  this  context,  although 
the  two  Normal  component*  have  to  be  rather  well  separated  for  the  effect  to  be 
substantial.  Let  A  »  lu^  -  u2l/o  and  suppose  “  1/2.  Then,  relative  to  the  case  in 
which  all  data  are  categorised,  the  aaymptotio  efficiencies  for  NO  data  and  for  M2  data 
with  50%  categorised  data  are,  reepectively,  10%  end  50%  (for  A  ■  2 ) i  65%  end  83%  (for 
A  -4). 

Empirical  evidence  of  the  gains  from  an  approximate  Bayesian  version  of  the 
eultiveriete  Normal  esse  Is  given  by  Titterington  (1976)  and  Anderson  (1979)  combines  the 
paradipas  (D)  and  (8)  by  parametrising  according  to  the  feotorlsetion  (S)  and  then 
analysing  the  data  by  logistic  methods,  which  are  diagnostic  in  spirit.  Silverman  (1978) 
estimates  likelihood  ratio's  nonparametricslly  using  date  soma  of  which  are  uncategorised. 

It  is  clearly  disturbing  that  the  two  parametrisations  baaed  on  <D)  end  (8)  lead  to 
qualitatively  different  results  in  the  present  context,  if  (D)  ia  used  wrongly  then 
potentially  useful  Information  is  not  being  used;  if  (6)  ia  used  wrongly  then  the  bonus  the 
mixture  date  appear  to  offer  is  misleading.  It  aakea  it  important  to  find  the  right  model 
in  any  given  application  and,  needless  to  say,  it  has  led  to  considerable  controversy  shout 
principle. 
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6.  HYPOTHESIS  TESTING  AMO  MULTIMODALITY 

In  Section  4  w  mentioned  cluster  analysis  as  a  means  of  establishing  the  number  of 
components  present  in  a  mixture  when  we  only  have  uncategorised  data.  Alternatively,  we 
may  use  likelihood  criteria  with  added  penalties  for  the  number  of  parameters  involved 
(Akaike,  1973,  Schwarz,  1978).  Another  possibility  is  to  seek  the  mixture  with  fewest 
components  which  is  still  compatible  with  the  data,  in  particular,  we  may  want  to  ask 
whether  there  really  is  a  mixture  or  whether  there  is  just  a  single  underlying  component. 
This  could  be  just  the  sort  of  question  we  want  to  ask  in  practice.  He  seem  to  be  on  well- 
trodden  ground,  if  paraiMtrio  models  may  be  assumed,  because  the  problem  can  be  formulated 
as  one  of  testing  between  two  nested  hypotheses,  for  which  the  generalised  likelihood  ratio 
test  is  available.  However,  we  soon  hit  snags. 

Example  2 

HO i  Single  Normal. 

Hli  Mixture  of  two  Normals. 

He  would  hope  to  evaluate  the  usual  *2  log  1“  test  statistic  and  refer  its  value  to 
a  percentile  of  a  x*  distribution  with  some  number,  v,  of  degrees  of  freedom.  What, 
however,  should  v  be?  In  most  problems,  v  is  obtained  as  the  number  of  constraints 
required  to  reduce  Hi  to  HO.  He  may  obtain  this  reduction  here,  however,  by  either i  (i) 

«1  ■  1  (1  constraint),  or  (ii>  u,  •  Uj,  «  o}  (2  constraints). 

Should  we  take  v  •  1  or  v  «  2  or  maybe  soma  intermediate  value,  as  conjectured  by 
Hsrtigen  (1977)7 

2 

Should  we  even  be  trying  to  use  the  X  table  at  all? 
ixample  1 

BC»  •  1 

Hit  0  <  <  1. 

dare  asymptotically,  undar  HO,  the  maximum  likelihood  estimator  for  in  tha  HI  modal  ia 
equal  to  1  with  probability  1/2  (Section  4.4).  Thus  2  log  1  la  aero,  with  probability 
1/2,  and  therefore  certainly  not  x** 
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The  problem  is  that  the  regularity  condition*  required  for  the  aaymptr  i  c  theory  do 
not  hold  (c.f.  Section  4.4).  Under  HO.  the  true  value  of  lies  on  the  boundary  of  its 

parameter  space.  In  Example  2,  HO  also  corresponds  to  a  region  on  the  boundary  of  the 
parameter  space  for  HI.  a  region  in  which  identif lability  'ails. 

So  far,  the  treatment  that  has  been  developed  for  this  important  difficulty  is  far 

2 

from  satisfactory.  Until  recently,  in  many  applications  the  x  approximation  has  been 
used  without  the  awkwardness  about  degrees  of  freedoa  being  detected.  For  Example  2  and 
multivariate  versions  thereof,  some  simulations  have  been  carried  out  in  attempts  to 

j 

concoct  a  number  of  degrees  of  freedom  to  us*  in  the  x  table;  see  Wolfe  (1972),  Aitkin 
et  al.  (1981)  and  Everitt  and  Hand  (1981).  Hardly  any  theoretical  work  has  been 
reported.  Davies  (1977)  mentions,  but  does  not  work  through  in  detail,  the  use  of  a  union- 
intersection  principle  for  one  special  example. 

Alternative  test  procedures  are  themselves  somewhat  unsatisfactory.  Engelman  and 
Hartigan  (1969)  use,  as  teat  statistic  for  Example  2,  the  estimated  Nahalanobls  distance 
corresponding  to  the  optimal  "maximum  likelihood'  clustering  of  the  data  into  two 
components  (end  of  Section  4.4).  Also  for  Exsmpls  2.  omnibus  tests  of  Normality  could  be 
ueed. 

A  final  poaslbillty  is  to  look  at  the  dsgrsa  of  multimodality  reprasentad  by  the 
data.  Of  course,  unimodality  of  a  density  is  not  equivalent  to  it*  corresponding  to  a 
single  component  density.  Indeed,  e  syemetric  mixture  of  two  univariate  Normals 
(»,  “  1/2,  only  bimodal  if  lUj  -  Uj|  >  20,  The  study  of  bimodal  and 

multimodal  densities  is,  however,  of  some  interest  end  the  two  component  Normal  mixture  is 
a  convenient  model  for  e  bimodal  density.  (An  alternative  on*  with  on*  fewer  parameter  ia 
the  quartic  exponential  density;  see  Mats,  1970.)  Many  papers,  especially  in  fields  of 
application,  talk  specifically  about  multimodality.  What  they  are  usually  interested  in, 
however,  is  the  poeslbl*  presence  of  e  mixture  (Murphy,  1904).  What  is  possible,  however, 
is  to  us*  *  significance  test  against  unimodality  as  a  conservative  test  against  the 
hypothesis  of  e  one-component  distribution.  At  lsest  th#  asymptotic  theory  will  not  cause 
such  problems. 
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Silverman  (1981)  has  developed  a  technique,  based  on  simulation  and  nonparametric 
density  estimation,  for  assessing  the  modality  of  a  data-set, 

7.  CLOSING  REMARKS 

It  is  hoped  that  we  have  done  justice  to  the  variety  of  applications  and  special 
problems  that  arise  with  distribution  mixtures  and  that  it  is  clear  that  the  thorniest 
problems  await  satisfactory  solution.  The  field  is  very  much  alive  and,  if  anything,  the 
publication  rate  on  this  topic  is  higher  than  ever. 

He  have  not  been  able  to  give  many  details  of  analysis,  nor  even  to  provide  anything 
like  a  full  list  of  references.  Several  papers  have  been  written  which  contain  survey  or 
bibliographic  materiali  see  Blischke  (1983),  Clark  (1976),  Macdonald  and  Pitcher  (1979), 
Odell  and  Basu  (1976)  and  Hurray  and  Titterington  (1978),  Further  reference  may  be  made  to 
sporadic  sections  in  the  quartet  of  books  by  Johnson  and  Kots  (1969-72),  to  Chapter  4  of 
Ord  (1972)  and  to  the  recent  monograph  by  Bveritt  and  Hand  (1981),  The  present 
contribution  arose  from  work  towards  a  forthcoming  book  by  Makov  at  al,  (1992)  where,  it  is 
hoped,  the  missing  details  and  references  will  be  fully  documented.  In  particular,  a  much 
fullar  account  of  the  sequential  methods  of  Section  4,7  will  be  provided. 
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